Skip to content

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440)#825

Open
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-26_final_champion
Open

Record: Order-Adaptive BackoffMixer (mean val_bpb=0.5440)#825
hypery11 wants to merge 1 commit intoopenai:mainfrom
hypery11:submission/2026-03-26_final_champion

Conversation

@hypery11
Copy link
Copy Markdown

Results

Seed val_bpb Eval time
42 0.5437 ~391s
1337 0.5450 ~391s
2024 0.5434 ~391s
Mean 0.5440
Std 0.0008
  • Artifact: ~16.0 MB
  • Train: 600s on 8xH100 SXM
  • Eval: ~391s (well under 600s)

Method

11-layer transformer (512d, 8/8 full MHA, XSA-all, LeakyReLU(0.5)^2, 3.5x MLP). Order-adaptive entropy-gated BackoffNgramMixer with per-order entropy thresholds. Score-first, backward-looking, deterministic.

Acknowledgments

Huge thanks to the incredible community that made this possible:

This competition has been an amazing collaborative experience. Every improvement here builds on ideas shared openly.

  • 8xH100 SXM, train <=600s
  • Eval <=600s (391s)
  • Artifact <=16MB
  • 3-seed validation (std 0.0008)

Seeds: 0.5437 / 0.5450 / 0.5434 (std 0.0008).
Order-adaptive entropy gating + BackoffNgramMixer.
~16MB artifact. Train 600s, eval 391s.
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Mar 26, 2026

Really impressive work — the order-adaptive entropy gating with per-order thresholds is a thoughtful design, and the 3-seed consistency (std 0.0008) is excellent. The acknowledgments section is also great to see — this competition has been genuinely collaborative.

One thing to flag: checking the log output, it looks like seeds 42 and 2024 may exceed the 16,000,000 byte artifact cap:

  • Seed 1337: 15,948,371 bytes ✅
  • Seed 42: ~16,022,243 bytes (over by ~22K)
  • Seed 2024: ~16,030,231 bytes (over by ~30K)

We ran into the exact same issue on our PR #769 seed 42 (over by 25,731 bytes) and had to rerun with tighter quantization. It's a subtle one — the submission.json may not reflect the per-seed sizes accurately.

Might be worth double-checking the individual seed artifact sizes against the 16,000,000 limit before the maintainers review. The fix for us was minor — just tightening the compression/quantization slightly to get the headroom.


Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

@MatoTeziTanka
Copy link
Copy Markdown

Circling back on this one with an updated finding, since @valerio-oai ruled on the underlying mechanism after my first comment.

Compliance flag — same disallowed pattern as PR #779.

@valerio-oai disallowed PR #779 (deanbrr) on 2026-03-27 (comment 4145781641) specifically for "hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens." The mechanism is spelled out in the follow-up comment 4146407380: hashing the ground-truth token into the lookup key only reweights the correct token, and in the hash-collision limit drives P(correct) toward 1 regardless of the data, giving arbitrarily low BPB without real compression.

Looking at records/track_10min_16mb/2026-03-26_OrderAdaptive_BackoffMixer/train_gpt.py, the BackoffNgramMixer (L39–145) is a port of #779's mixer with an entropy-gating delta on top, and it uses the same target-in-key hashing pattern at:

  • L76 (update): full_key = ((ctx_hash ^ (tgt * self.primes[cw])) & mask).astype(np.int64) — hashes target tgt into the bucket
  • L78: np.add.at(self.full_counts[oi], full_key, 1) — increments the target-conditioned count
  • L119 (mix_and_score): full_key = ((ctx_hash ^ (y_np.astype(np.uint64) * self.primes[cw])) & mask).astype(np.int64) — same hash with y_np as the target
  • L121: full_c = self.full_counts[oi_rev][full_key.reshape(-1)] — looks up the target-conditioned count
  • L1091: mixer.update(val_tokens[chunk_start_tok:chunk_end_tok + 1]) — also still has the +1 boundary leak I flagged on Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 (which was fixed there in commit c58742a after my review; this PR branched from pre-fix code).

Under @valerio-oai's #779 ruling, this is the same Rule 1 violation (Issue #1017 condition 1 — p_t may depend only on the artifact and x_1...x_{t-1}). The 0.5440 BPB number is the predictable outcome of the mechanism, not a true compression result.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: CLOSE under the same ruling as #779. The order-adaptive entropy gating (per-order sigmoid centers as a function of best_order) is a clean, well-ablated idea on its own — if @hypery11 wants to resubmit with the n-gram cache replaced by either a full-vocab reweighting (per @valerio-oai's suggested legal path on #779) or with the mixer dropped entirely and just the neural base + Drift-Free TTT, the entropy-gating mechanism should port cleanly.

@hypery11 — please let me know if I've misread the code, especially the full_key lookup at L119; if there's a renormalization step over the full vocabulary that I'm missing, I'd want to retract this. Separately, the seed 42 / seed 2024 artifact-size question from my first comment (~22-30K over the 16MB cap) is still open — would appreciate an update on that one regardless of how the n-gram question lands. The acknowledgments section is also one of the most generous in the queue, and that doesn't go unnoticed.


Reviewed by @MatoTeziTankaThe Agora. Static code review against train_gpt.py at SHA 79ae889a. AI tooling: review drafted with Claude Code (Sonnet/Opus) using an internal review template; all citations, file paths, and compliance audits were verified against the PR's actual code.

MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 11, 2026
…cluster + CT2038 gauntlet provisioned

Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf.
Two cluster-level findings:

- N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash
  ^ (target * primes[k])) & mask) — target token hashed into the eval-cache
  lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern
  in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream
  parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued).

- Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window
  delta+logit_bias optimized N steps against (per_token_nll * mask) where
  mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD;
  openai#1319/openai#1376 → CLOSE.

Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA
megakernel triple loop).

Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min
ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs
plus ~30 ngram-cache PRs.

Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores)
as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0,
deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally
skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments
via gh api PATCH to add the rerun results. Coverage went from 9/20 to
14/20 fully gauntleted.

Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254
fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants